Expressive TTS Training With Frame and Style Reconstruction Loss

نویسندگان

چکیده

We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system that improves the speech styling at utterance level. One of key challenges in prosody modeling is lack reference makes explicit difficult. The proposed technique doesn’t require annotations from data. It attempt to model explicitly either, but rather encodes association between input text and its styles using TTS framework. This study marks departure style token paradigm where modeled by bank embeddings. adopts combination two objective functions: 1) frame level reconstruction loss, calculated synthesized target spectral features; 2) deep features speech. loss formulated as perceptual ensure taken into consideration during training. Experiments show achieves remarkable performance outperforms state-of-the-art baseline both naturalness expressiveness. To our best knowledge, this first incorporate quality function Tacotron improved

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semantics and Discourse Processing for Expressive TTS

In this paper we present ongoing work to produce an expressive TTS reader that can be used both in text and dialogue applications. The system has been previously used to read (English) poetry and it has now been extended to apply to short stories. The text is fully analyzed both at phonetic and phonological level, and at syntactic and semantic level. The core of the system is the Prosodic Manag...

متن کامل

Designing speech database with prosodic variety for expressive TTS system

For the purpose of building speech synthesis system that can generate high-quality speech with wide range in prosody and realize fine prosody control, we propose new speech database constructing method. As a speech synthesis method, we select a hybrid system which consists of two part : speech unit selection and prosody modification part by STRAIGHT (vocoder type high quality analysis-synthesis...

متن کامل

Adding speaking style to a TTS system

This paper aims to enhance the performance of a TTS system by generating various speaking styles. First we describe three speaking styles (Radio News, Political Address and Conversation) and compare the prosodic features found in these authentic styles with the prosody in “neutral” speech uttered by the eLite TTS system ([1]). Differences concern about 20 prosodic characteristics (F0 span, spee...

متن کامل

Emotional Style Conversion in the TTS System with Cepstral Description

This contribution describes experiments with emotional style conversion performed on the utterances produced by the Czech and Slovak textto-speech (TTS) system with cepstral description and basic prosody generated by rules. Emotional style conversion was realized as post-processing of the TTS output speech signal, and as a real-time implementation into the system. Emotional style prototypes rep...

متن کامل

Expressive language style among adolescents and adults with Williams syndrome

Language samples elicited through a picture description task were recorded from 38 adolescents and adults with Williams syndrome (WS) and one control group matched on age, and another matched on age, IQ, and vocabulary knowledge. The samples were coded for use of various types of inferences, dramatic devices, and verbal fillers; acoustic analyses of prosodic features were carried out, and an in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing

سال: 2021

ISSN: ['2329-9304', '2329-9290']

DOI: https://doi.org/10.1109/taslp.2021.3076369